Version: 1.0.4

LightGBM

LightGBM is an open-source, distributed, high-performance gradient boosting (GBDT, GBRT, GBM, or MART) framework. This framework specializes in creating high-quality and GPU enabled decision tree algorithms for ranking, classification, and many other machine learning tasks. LightGBM is part of Microsoft's DMTK project.

Advantages of LightGBM

Composability: LightGBM models can be incorporated into existing SparkML Pipelines, and used for batch, streaming, and serving workloads.
Performance: LightGBM on Spark is 10-30% faster than SparkML on the Higgs dataset, and achieves a 15% increase in AUC. Parallel experiments have verified that LightGBM can achieve a linear speed-up by using multiple machines for training in specific settings.
Functionality: LightGBM offers a wide array of tunable parameters, that one can use to customize their decision tree system. LightGBM on Spark also supports new types of problems such as quantile regression.
Cross platform LightGBM on Spark is available on Spark, PySpark, and SparklyR

LightGBM Usage:

LightGBMClassifier: used for building classification models. For example, to predict whether a company enters bankruptcy or not, we could build a binary classification model with LightGBMClassifier.
LightGBMRegressor: used for building regression models. For example, to predict the house price, we could build a regression model with LightGBMRegressor.
LightGBMRanker: used for building ranking models. For example, to predict website searching result relevance, we could build a ranking model with LightGBMRanker.

Bankruptcy Prediction with LightGBM Classifier

In this example, we use LightGBM to build a classification model in order to predict bankruptcy.

Read dataset

from synapse.ml.core.platform import *

df = (
    spark.read.format("csv")
    .option("header", True)
    .option("inferSchema", True)
    .load(
        "wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv"
    )
)
# print dataset size
print("records read: " + str(df.count()))
print("Schema: ")
df.printSchema()

display(df)

Split the dataset into train and test

train, test = df.randomSplit([0.85, 0.15], seed=1)

Add featurizer to convert features to vector

from pyspark.ml.feature import VectorAssembler

feature_cols = df.columns[1:]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
train_data = featurizer.transform(train)["Bankrupt?", "features"]
test_data = featurizer.transform(test)["Bankrupt?", "features"]

Check if the data is unbalanced

display(train_data.groupBy("Bankrupt?").count())

Model Training

from synapse.ml.lightgbm import LightGBMClassifier

model = LightGBMClassifier(
    objective="binary", featuresCol="features", labelCol="Bankrupt?", isUnbalance=True
)

model = model.fit(train_data)

"saveNativeModel" allows you to extract the underlying lightGBM model for fast deployment after you train on Spark.

from synapse.ml.lightgbm import LightGBMClassificationModel

if running_on_synapse():
    model.saveNativeModel("/models/lgbmclassifier.model")
    model = LightGBMClassificationModel.loadNativeModelFromFile(
        "/models/lgbmclassifier.model"
    )
if running_on_synapse_internal():
    model.saveNativeModel("Files/models/lgbmclassifier.model")
    model = LightGBMClassificationModel.loadNativeModelFromFile(
        "Files/models/lgbmclassifier.model"
    )
else:
    model.saveNativeModel("/tmp/lgbmclassifier.model")
    model = LightGBMClassificationModel.loadNativeModelFromFile(
        "/tmp/lgbmclassifier.model"
    )

Feature Importances Visualization

import pandas as pd
import matplotlib.pyplot as plt

feature_importances = model.getFeatureImportances()
fi = pd.Series(feature_importances, index=feature_cols)
fi = fi.sort_values(ascending=True)
f_index = fi.index
f_values = fi.values

# print feature importances
print("f_index:", f_index)
print("f_values:", f_values)

# plot
x_index = list(range(len(fi)))
x_index = [x / len(fi) for x in x_index]
plt.rcParams["figure.figsize"] = (20, 20)
plt.barh(
    x_index, f_values, height=0.028, align="center", color="tan", tick_label=f_index
)
plt.xlabel("importances")
plt.ylabel("features")
plt.show()

Model Prediction

predictions = model.transform(test_data)
predictions.limit(10).toPandas()

from synapse.ml.train import ComputeModelStatistics

metrics = ComputeModelStatistics(
    evaluationMetric="classification",
    labelCol="Bankrupt?",
    scoredLabelsCol="prediction",
).transform(predictions)
display(metrics)

Quantile Regression for Drug Discovery with LightGBMRegressor

In this example, we show how to use LightGBM to build a regression model.

Read dataset

triazines = spark.read.format("libsvm").load(
    "wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight"
)

# print some basic info
print("records read: " + str(triazines.count()))
print("Schema: ")
triazines.printSchema()
display(triazines.limit(10))

Split dataset into train and test

train, test = triazines.randomSplit([0.85, 0.15], seed=1)

Model Training

from synapse.ml.lightgbm import LightGBMRegressor

model = LightGBMRegressor(
    objective="quantile", alpha=0.2, learningRate=0.3, numLeaves=31
).fit(train)

print(model.getFeatureImportances())

Model Prediction

scoredData = model.transform(test)
display(scoredData)

from synapse.ml.train import ComputeModelStatistics

metrics = ComputeModelStatistics(
    evaluationMetric="regression", labelCol="label", scoresCol="prediction"
).transform(scoredData)
display(metrics)

LightGBM Ranker

Read dataset

df = spark.read.format("parquet").load(
    "wasbs://publicwasb@mmlspark.blob.core.windows.net/lightGBMRanker_train.parquet"
)
# print some basic info
print("records read: " + str(df.count()))
print("Schema: ")
df.printSchema()
display(df.limit(10))

Model Training

from synapse.ml.lightgbm import LightGBMRanker

features_col = "features"
query_col = "query"
label_col = "labels"
lgbm_ranker = LightGBMRanker(
    labelCol=label_col,
    featuresCol=features_col,
    groupCol=query_col,
    predictionCol="preds",
    leafPredictionCol="leafPreds",
    featuresShapCol="importances",
    repartitionByGroupingColumn=True,
    numLeaves=32,
    numIterations=200,
    evalAt=[1, 3, 5],
    metric="ndcg",
)

lgbm_ranker_model = lgbm_ranker.fit(df)

Model Prediction

dt = spark.read.format("parquet").load(
    "wasbs://publicwasb@mmlspark.blob.core.windows.net/lightGBMRanker_test.parquet"
)
predictions = lgbm_ranker_model.transform(dt)
predictions.limit(10).toPandas()

LightGBM

Advantages of LightGBM​

LightGBM Usage:​

Bankruptcy Prediction with LightGBM Classifier​

Read dataset​

Split the dataset into train and test​

Add featurizer to convert features to vector​

Check if the data is unbalanced​

Model Training​

Feature Importances Visualization​

Model Prediction​

Quantile Regression for Drug Discovery with LightGBMRegressor​

Read dataset​

Split dataset into train and test​

Model Training​

Model Prediction​

LightGBM Ranker​

Read dataset​

Model Training​

Model Prediction​

Advantages of LightGBM

LightGBM Usage:

Bankruptcy Prediction with LightGBM Classifier

Read dataset

Split the dataset into train and test

Add featurizer to convert features to vector

Check if the data is unbalanced

Model Training

Feature Importances Visualization

Model Prediction

Quantile Regression for Drug Discovery with LightGBMRegressor

Read dataset

Split dataset into train and test

Model Training

Model Prediction

LightGBM Ranker

Read dataset

Model Training

Model Prediction